Project - Unsupervised Learning

Part One

1. Import and warehouse data:

The data containing nine attributes is stored in two separate files (Car name.csv and Car Attributes.json) which is often case in the real-world data which might be stored in multiple formats at multiple places. So, we read in the dataframes and concatenate to a single dataframe

2. Data cleansing:

To predict mpg of a car, the car_name (which is unique for each car) doesn't add any information to help the clustering task; So, we drop car_name variable
Impute the 6 values with nearest neighbor imputation
yr, origin, cyl are categorical variables, but pandas guessed their data type wrong; Converting them to categorical.
Also, converting hp to int from object

3. Data analysis & visualisation:

All the variables have different ranges (with max ranging from to 5140); So, scaling the variables is a must before fitting models.
We can infer that mpg, acc, yr seem to follow a normal distribution (median ~ mean)
Other variables have non-standard distributions
The range of our target mpg is between 9 and 46.6.
From the report, We have the following observations:
So, for the fields (hp, cyl , wt), we have to adopt the strategy of either choosing only one variable which is most correlated with the target variable or compute pricipal components to preserve the variance explained but lose on some degree of interpretability
Univariate plots
From the above above plots:
Univariate plots
Multivariate plots

4. Machine learning Modelling:

Before modeling, we will do some data preprocessing:

K Means

5 seems to be a reasonable choice of number of clusters based on the above plot

Observations:

Hierarchical

Seems like both clustering techniques found clusters with similar counts, which is primising

Heirarchical

K-Means

5. Observations based on outcomes of using ML based methods.

The optimal number of clusters from modeling perspective is subjective as the number of clusters is driven by business objective and the actual patterns in the data. So, from that perspective, there seem to be 3 distinct clusters with:

Use linear regression model on different clusters separately and print the coefficients of the models individually

How using different models for different clusters will be helpful in this case and how it will be different than using one single model without clustering? Mention how it impacts performance and prediction.

As, we can clearly see, each cluster has widely different coefficients for the same variables when a linear model is fitted in the sub-dataset to predict mpg, This indicates that each cluster has different relationship between the target variable mpg and the predictive features. Hence, It is a must to train separate models for correct predictions each cluster to understand and intepret the predictions appropriately within each cluster.
Also, the important features also have differed for each cluster to predict the mpg
Single model across all the clusters will fail to capture the differences between clusters and naively try to capture overall data feature to make predictions. Hence, this model will be underfit and suffer from low accuracy although latency for prediction would be better as the cluster need not be determined for a new data point before loading up the appropriate cluster model to make the final prediction.

6. Improvisation:

• Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the company to perform a better data analysis in future.

There has to be much more amount of data (very less number of data points used now) to be collected by the company if the analysis is to be applicable and useful in real life,
The data needs to be properly collected and vetted so that it is complete and no wrong or missing values occur
We would need some more data for each car like the car_type like electric, sport, luxury, muscle, or, green etc., to aid our analysis and improve clustering

Part Two

We have 18 missing entries for Quality model. So, we can train a model that does clustering on the dataset so that based on the clusters information the Quality of the missing values can be imputed or filled in by the company

Hence, we are able to make 48% accurate quality classification just based on the two clusters

Part Three

1. Data: Import, clean and pre-process the dat

2. EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hidden patterns by using all possible methods.

From the associations tab, the correlation heatmpa shows that multiple variables are highly correlated to each other, Hence, PCA maybe beneficial here

3. Classifier: Design and train a best fit SVM classier using all the data attributes.

4. Dimensional reduction: perform dimensional reduction on the data.

Let's use 3 principal components as the rest of the component add only very negligible amounts of variance explained

5. Classifier: Design and train a best fit SVM classier using dimensionally reduced attributes.

6. Conclusion: Showcase key pointer on how dimensional reduction helped in this case

Summary:

Part Four

1. EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hidden patterns by using all possible methods.

2. Build a data driven model to rank all the players in the dataset using all or the most important performance features.

We can make the following ranks based on the 4 clusters identified by KMeans

Group 2: Rank 1 | Highest hitters with maximum runs, ave, sr, fours, sixes and HF

Group 0: Rank 2 | Ranked second on average in all metrics

Group 3: Rank 3 | Ranked third on average in all metrics

Group 1: Rank 4 | Lowers hitters with least runs, ave, sr, fours, sixes and HF

These clusters should help take the business decisions for the sports management companies by the ranking from the analysis

The Cricketers segregated into ranks for the use of the sports management company

Part Five

List down all possible dimensionality reduction techniques that can be implemented using python.

Dimensionality reduction is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension. Working in high-dimensional spaces can be undesirable for many reasons; raw data are often sparse as a consequence of the curse of dimensionality, and analyzing the data is usually computationally intractable. Dimensionality reduction is common in fields that deal with large numbers of observations and/or large numbers of variables, such as signal processing, speech recognition, neuroinformatics, and bioinformatics.

Dimensionality Reduction techniques that can be implemented in python:

sources:1, 2 3

So far you have used dimensional reduction on numeric data. Is it possible to do the same on a multimedia data [images and video] and text data ? Please illustrate your findings using a simple implementation on python.

Hence, we can see that the image colors and contours is retained and the quality degradation is not too much albeit it isn't a lossless compression as some information is lost when the principal components are computed.

More the number of principal components are used, the higher the quality and mroe the information is preserved depending on how much dimensions we want to reduce. This transformed data can be used for deep learning based image classification/ recognition or other models which use image to work without the problem of curse of dimensionality. Similarly, Dimensionality reduction could be employed for other types of data.